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Abstract. We present numerical simulations of galaxy formation, one 
of the most challenging problems in computational astrophysics. The key 
point in such simulations is the efficient solution of the N-body problem. 
If the gas of a galaxy is treated by means of smoothed particle hydro- 
dynamics (Sph), the hydrodynamic equations can be reduced to a form 
similar to that of the N-body problem. A straightforward implementa- 
tion requires a computational effort oc N 2 , making it prohibitively ex- 
pensive to simulate systems larger than 10 6 particles even on the largest 
available supercomputers. 

After a description of the physical and numerical problems, we shortly 
review the standard numerical methods to tackle these problems and 
discuss their advantages and drawbacks. We also present a completely 
different approach to perform such simulations using a workstation in 
combination with the special purpose hardware GRAPE. After a discus- 
sion of the main features of GRAPE, we present a simple implementation 
of a SPH-N-body code on such a configuration. Comparing results and 
performance of these two approaches, we show, that with an investment 
of US $ 50000, the problem can be solved up to 5 times faster than on a 
Cray YMP. 

1 Introduction 

During the last decade one of the most active fields of astrophysics has been 
the study of the formation of galaxies. It is commonly believed, that up to 95% 
of the matter of the universe is composed of dark matter, which is probably of 
non-baryonic origin and interacts mainly via gravitation. According to a widely 
accepted idea galaxies form by the collapse of gravitationally unstable primor- 
dial density fluctuations. The evolution of the gravitationally dominant dark 
matter is treated by N-body techniques. During the last years, one has begun 
to add gasdynamics to the simulations to mimic the evolution of the baryons. In 
some simulations, stars and galaxies are formed out of the collapsing gas, which 
are again treated by N-body techniques. The grand computational challenge 
of such simulations is twofold: Firstly, one can show [1], that the gravitational 
collapse must proceed anisotropically, i.e., three-dimensional simulations are 
necessary. Secondly, very different length scales are involved, starting from 1 pc 
(« 3.1 10 13 km), which is the size of a typical star forming region, up to several 
hundred Mpc, a volume which can be considered to be a representative piece of 



our universe. This range of scales must be compared with the largest simulations 
feasible nowadays, which are performed on a grid of about 300 3 zones. Besides 
classical finite-difference methods, a completely different approach to solve the 
hydrodynamic equations is used in the astrophysical community: Smoothed Par- 
ticle Hydrodynamics (Sph, [2]). Its main advantage is to be a free Lagrangean 
method. This makes it optimally suited for highly irregular clustered systems 
like galaxies. Although it is still to prove mathematically that the Sph equations 
converge to the hydrodynamical fluid equations, a series of test calculations has 
shown that the quality of Sph results can compete with that of modern finite 
difference schemes [3], even with suprisingly small particle numbers! 

2 Current techniques 

The key problem to perform large scale computer simulations of structure for- 
mation is an efficient solution of the N-body problem, i.e., the calculation of 
the acceleration of the particle i due to the gravitational interaction with 
all other particles j of the system: 



with nj = |r; — Tj\. s is the so called softening parameter, which prevents 
the 1/r 2 divergence of the force for r — > and limits the resolution. Methods 
which directly solve the system (1) are called Particle-Particle (PP) methods 
[4]. Because gravity is a long range force, the computational effort to determine 
the force on all N particles grows oc N 2 . This makes it prohibitively expensive to 
perform simulations involving much more than 10 4 particles, even on the fastest 
available supercomputers. In Sph, the force law is of similar form, it is given by 



In this equations, W is the interpolation kernel with a shape similar to a Gaus- 
sian, h is the smoothing length. dvi/dt\ grav is the gravitational acceleration ac- 
cording to Eq. (1). Besides the particle number, h determines the resolution of 
the system. The pressure P, the internal energy e and the density q are re- 
lated by an equation of state. Qij is an artificial viscosity introduced to treat 
shock waves. Note, that as long as the kernel W has compact support, all the 
corrections of Eqs (2) to Eq. (1) are of short range nature, i.e., the additional 
computational effort is only oc N. 
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In the past, several techniques have been developed to circumvent the N 2 
behaviour: Particle-mesh methods (PM) do not explicitly solve Eq. (1). Instead, 
the distribution of particles is assigned to a grid, the mass per zone defining a 
density. Via Fast Fourier Transform (FFT), Poisson's equation is solved on the 
grid. The forces are than interpolated to the particle position. The computational 
effort grows only oc N log M, where M is the number of zones per dimension. 
Though PM schemes are very fast, they are only suitable for relatively homoge- 
neous systems. The resolution of a PM calculation is determined by the size of 
a zone, and even the largest current PM simulations with 500 3 zones and 250 3 
particles have only a very limited spatial resolution. One tries to circumvent 
these problems by the P 3 M (= PPPM) technique. Here, the long range forces 
are calculated via the PM method, but the short range forces exerted by particles 
in the same or in the neighbouring zones are treated by the PP technique. The 
main drawback of P 3 M is that for highly clumped structures, a large number of 
particles is placed within a few zones, and the PP part becomes computationally 
dominant. For very large simulations (200 3 ) particle numbers exceeding 10 4 per 
zone are not atypical. The computational effort of P 3 M is difficult to estimate: 
for a homogeneous system it is oc NlogM as for PM, in the worst case, most 
of the paricles are located in a few cells, the performance is degraded to the N 2 
behaviour of the PP method. Another approach is the tree algorithm [5], [6]. 
Here, the main idea is to group distant particles together and to approximate 
the force exerted by this group by that of one particle of the same mass. A tree 
data structure is used to systematically group particles together. Comparing the 
extension of the group with its distance to a specific particle, one can determine 
whether this force approximation is accurate enough, or whether the group has 
to be split into subgroups, for which the same procedure is applied. The result is, 
that instead of N only oc log N interactions are to be calculated for every parti- 
cle. Therefore, the computational effort scales like NlogN. The performance of 
tree methods also decreases with increasing dumpiness, although much weaker 
than P 3 M methods. However, the construction of the tree causes some overhead, 
which may become critical in a multiple timestep scheme, if only the force for a 
few particles has to be calculated. Finally, tree algorithms are relatively complex 
and require a lot of memory. 

There exist various implementations of Sph and N-body codes on vector com- 
puters, but only very few on massively parallel machines (e.g. [7], [8], [9], [10]). 
Therefore, only very vague statements can be made about their performance. 
To implement very efficient PP codes on vector or shared memory machines is 
quite easy, but on a distributed memory machine this task is not unproblem- 
atic. For PM and P 3 M the FFT part can be handled efficiently. However, the 
mesh assignment of particles and the force interpolation are difficult to vectorize 
and involve a lot of indirect addressing. In case of parallel machines it is diffi- 
cult to achieve a good load balancing for the mesh asignment and interpolation 
step. The recursive structure of a tree algorithm is difficult to vectorize, a lot 
of indirect addressing combined with short vector lengths is involved. By area 
decomposition, it is possible to run a tree code on a distributed memory machine 



with high performance [7], but major changes to the standard tree code have to 
be made. The currently largest N-body simulations (260 3 ) were done with such 
a code [11]. However, we think it is difficult to get a good load balancing if a 
multiple time step scheme is used, which is essential for a good performance of 
a Sph code. 

In summary, there exist different techniques to tackle the N-body system 
with good performance on current supercomputers, although it is difficult to 
come close to the peak performance. For a good compromise between speed and 
resolution, P 3 M and tree codes are the favourite choices. Simulations involving 
300 3 particles are the current limit. In combination with Sph, the respective 
numbers are much smaller and even the largest simulations involve only a few 
times 10 5 particles. The main reasons are: (i) The force calculation becomes 
more expensive, (ii) The particles have an extension, which increases the num- 
ber of short range force evaluations, (iii) The number of time steps is 10-100 
times higher. The resources necessary to perform a typical simulation of the for- 
mation of galaxies (timings for a tree code on one Cray YMP processor) are 
the following: A 4000 particle N-body simulation (« 1000 timesteps) requires 40 
min. The same simulation using a two component system of 4000 Sph and 4000 
N-body particles (15000 timesteps) requires about 10 h, and a simulation with 
4000 Sph, 4000 N-body, and at the end about 25000 star particles (N-body) 
needs 60 hours. Simulations with 32 3 gas and dark matter particles as performed 
by [12] have consumed more than 200 Cray hours. Calculations with a million 
particles of two or three different species would consume several thousands of 
Cray hours. Finally note, that although these calculations may be regarded 
as grand challenges for future generations of supercomputers, from the physical 
point of view they still have only a very limited resolution! 

3 The GRAPE Project 

Up to now, all methods are based on software development. A completely dif- 
ferent approach was chosen by Sugimoto and collaborators at the University of 
Tokyo (for an overview see [13]): The calculation of one force interaction is a 
combination of a very few specific arithmetic operations: three differences, three 
squares, one sum, etc. Then, the inter particle forces have to be summed up. 
Furthermore, many of these operations do not depend on each other and can be 
done in a pipeline. Since the number of operations to get the total force on one 
particle is the same for every particle, one can parallelize it with a very good load 
balancing. Sugimoto et al. have designed a series of special purpose hardware 
boards Grape (GRAvity PipE) to calculate (1). Furthermore, a list of particles 
within a sphere of a given radius hi of particle i is returned, too. This is very 
helpful for an implementation of Sph. The board is connected to a workstation 
via a VME interface. Libraries allow one to use Grape by FORTRAN or C sub- 
routine calls. The prototype GrapeI reached 240 Mflops in 1989. Meanwhile 
there exist two series of boards: the odd numbers (GrapeI, Grape3) are low 
precision boards (18 bit or ss 1% accuracy in the force), which are sufficient for 



most astrophysical applications. The machines with even numbers (GRAPE2 and 
GRAPE4) are working with 32 and 64 bit arithmetic and are designed to calcu- 
late molecular systems and specific stellar dynamical problems. In GRAPE3, one 
GrapeIA board is put into a customized LSI chip. Presently, GRAPE3Af, which 
consists of 8 such chips, is produced in a small series and is available for about 
US $ 20000. Its peak performance is 4.8 Gflops. Up to 16 boards can be put 
together to work in parallel. Up to 1995 the GRAPE4 project should be finished. 
In GRAPE4, the GRAPE2 board is put into a LSI chip and 1500 of such chips are 
combined together. This board will reach a performance in the Teraflop regime. 
Although the flop rate of Grape is very impressive, one should keep in mind 
that PP techniques have a much larger operation count to calculate the force on 
a particle than the approximative techniques mentioned above. Furthermore, in 
hydrodynamic simulations, a non-negligible part of the computational time is 
necessary to calculate the pressure force and the equation of state. 

In a series of publications the Tokyo group has shown that it is possible to 
perform large N-body simulations on such a board with a speed close to its 
peak performance. Furthermore, a Sph code was implemented. In such a code, 
the gravitational force and the neighbour list was obtained with Grape, the 
evaluation of the hydrodynamic force and the solution of the equation of state 
being done on a workstation. A tree code was implemented on Grape, too. 
Again, the force evaluation is done on the board, but the tree construction and 
the determination of the interaction list has to be done on the host. Thus, a 
powerful workstation is essential, in order to use a Sph and/or a tree code with 
Grape efficiently. It should also be no serious problem to implement a P 3 M on 
Grape: The PM part is done on the host, the PP part on Grape. 

4 Results 

The following comparison holds for a multiple timestep SPH-N-body tree code 
[3] written and optimized to run on a Cray. All CPU timings are given for 
one processor of a Cray YMP 4/64 (333 Mflops). The timings for Grape are 
obtained on one single Grape 3Af board (8 LSI chips, 4.8 Gflops). The host 
is a SPARC 10 clone (« 15 Mflops). The unchanged tree code runs about 20 
times slower on the SPARC10 than on the Cray. Using Grape the main code 
structure remains unchanged, only the subroutines for the force calculation and 
the neighbour list are replaced by the Grape routines, i.e., we compare a tree 
code on the Cray with a PP code on Grape. To accelerate the computations 
on the workstation, REAL*4 arithmetic is used whenever possible. Only little 
effort was spent to optimize the host calculation for the Sun. Comparing N- 
body simulations one should keep in mind that the N-body system is chaotic. 
Thus, it is not possible to compare position, velocities and other properties of 
specific particles, but only the structure and kinematic of the whole system. 

To become familiar with Grape and to adapt the N-body code required 
only two days. A 4000 (33000) body simulation requires 40min (10 h) on the 
Cray. The same result was obtained with Grape in 9min (1.7h). More than 



80% of the calculations are done on the board. In both cases, the system ends 
in a highly clustered state, which is advantageous for Grape. In the case of a 
64 3 calculation, the Cray is about two times faster for a moderately clustered 
system, in the case of a highly clustered system two times slower. The break 
even point between PP on Grape and tree on Cray is of the order of a few 10 5 
particles. The tree code requires 220 MB of memory, the GRAPE only 50 MB 
(REAL*4), i.e., one can perform the same simulation without any problem on 
a mid class workstation. Running the same code on two boards gives a speed 
up of 1.1, 1.5 and 1.9 for 4000, 33000 and 64 3 particles. A multi board version 
combined with a fast workstation would allow simulations with several million 
particles within a few days. 

In the last paragraph we discussed the performance of Grape for a well suited 
problem. Even more interesting is its performance for more general problems, 
which only partially exploit the special features of Grape. In the Cray code, 
the computing time for the Sph part is about 10%, i.e., the problem is compu- 
tationally dominated by gravitation. In contrast to the previous problem, some 
changes in the algorithms are necessary before Sph runs efficiently, but the effort 
for these changes is negligible compared to the effort necessary to run the code 
on a parallel platform. The resulting code requires about 7 hours for a simulation 
with 4000 Sph and 4000 dark matter particles. This is about 1.3 times faster than 
on the Cray. About 80% of the computations were done on the workstation. 
Replacing the relatively slow SPARC 10 by a faster one, a speedup of a factor 2 
or even more should be easily possible. The behaviour should be even better for 
larger particle numbers, because the performance on Grape is limited by the 
Sph part, which grows oc N, whereas the performance on Cray is limited by the 
gravitational force calculation which grows like NlogN. Furthermore, judging 
from our experience the maximum possible simulation on Grape will not be 
limited by the N 2 behaviour of Grape for large particle numbers but rather 
by a shortage of memory and I/O operations on the workstation. Therefore, 
we believe that it makes no sense to use more than one board for such simu- 
lations. In another simulation, we have taken star formation into account, i.e., 
in gravitationally unstable regions, some mass is removed from the gas particles 
and put into new collisionless star particles. Consequently, during a simulation 
the number of N-body particles continuously grows. Typically, 10000-25000 star 
particles are formed from 4000 gas particles [14]. Because all new particles are 
located in the densest regions, the degree of clustering increases, which results in 
a larger computing time on a Cray (about 60 hours). On Grape the computing 
time is still limited by the hydrodynamic part, which does not depend on the 
number of star particles. Therefore, the computing time grows only moderately 
to about 20 hours, i.e., Grape is three times faster. 

5 Conclusion 

We have presented a highly active and interesting field of current astrophysical 
research, namely structure formation in the universe. We have shown that the 



scientific progress in this field is tightly coupled to the capabilities of the available 
supercomputers. As an alternative, a combination of a workstation with the 
special purpose hardware Grape can solve problems of comparable size within a 
time comparable to that on a supercomputer like a CRAY. The Grape hardware 
is applicable to N-body simulations and Sph hydrodynamical simulations the 
power of the host being crucial for a good performance in the latter case. A 
meaningful use of Grape requires a problem dominated by the calculation of 
gravitational forces. The force evaluation on the low precision boards (Grape3) 
is only accurate to « 1%. Numerical simulations which demand a higher precision 
should be run on the high precision boards (Grape4). In that case, however, an 
investment of about US $ 100 000 is necessary to get a similar performance. In 
summary, the Grape hardware is applicable to a whole class of astrophysical 
problems. It is a very attractive alternative to current supercomputers, at least 
for problems which are dominated by CPU time rather than by memory. 
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